Pandas and Numpy

Pandas Series is a table with 1 column , row index and a name.

Series

Pandas series automatically indetifies the type of the series type by just printing it out
s = pd.Series()
s
Mix type is not a problem for pandas.Series as it will conver it all to object type
We can acess a particular value in series using loc or iloc which selects the value given its name or index value respectively

Vectorization

Pandas is also very fast in cdoing computation over the series.
for example if we use for loop to count a series of number vs when we do it using np.sum(s)
The time differnece is huge and shocking
This can also be done for any type of data
Becasue of vectorization and paralller programming.(more on this)

This is very important and to be known where which fucntion should be used for faster access

DataFrame

DataFrame is equivalent to table in dataframe or collection of series.
Just like series it it has index, since there are multipe series it has multiple name for series which are called column names

Row Name/Column Name	Name 1	Name 2	Name 3
0	V1	V2	v3
1	V1	V2	v3

and just like series the values can be accessed using loc and iloc fucntion
Both take a row and a column or list of row and list of columns via names or via number
Adding a column to dataframe is as easy as assigning them a value to dataframe
We can create a DF using pandas.DataFrame() function which takes in value of iterative object.
This iterative value can be a list or index

1234lis = [1,2,3,4,5,6]
lis = [1,2,3,4,5,6]
s= pd.Series(lis,name='A',index = ['Zero','One','Two','Three','Four','Five'])
df = pd.DataFrame(lis,columns=['A'],index = ['Zero','One','Two','Three','Four','Five'])

>Usually we work on dataset which we convert to dataframes to perform analysis on. Usually files as such can be csv, excel, text and we need to make sense of these files.

12o = pd.read_csv('olympics.csv')
o.head()

We can see that we have unwanted index and columns and what we actually need are the 1 row as column names and 1st column as index.
We can do by takeing advantage of the **read_csv()**

12o = pd.read_csv('olympics.csv',skiprows=1,index_col=0)
o.head()

> Gives much better result but not entierly. We can still se that some column names do not make sense or can ambiguous. Lets do a little more formatting

123456for col in o.columns:
    if col[:2] == '01': o.rename(columns={col:'Gold'+col[5:]},inplace=True)
    if col[:2] == '02': o.rename(columns={col:'Silver'+col[5:]},inplace=True)
    if col[:2] == '03': o.rename(columns={col:'Bronze'+col[5:]},inplace=True)
    if col[:1] == '№': o.rename(columns={col:'#'+col[2:]},inplace=True)
o.head()

R/C	#Summer	Gold	Silver	Bronze	Total	#Winter	#Games	Gold2	Silver2	Bronze2	Combined total
Afghanistan	(AFG)	13	0	0	2	2	0	13	0	0	22
Algeria (ALG)	12	5	2	8	15	3	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	2	3	4	5	12

This is much better and more understandable. Now we can use it for futher analysis purposes.

Boolean Masking

Boolean masking is one the way do query our dataframe

1o['Silver'] >= 5

123456Afghanistan (AFG)          False
Algeria (ALG)              False
Argentina (ARG)             True
Armenia (ARM)              False
Australasia (ANZ) [ANZ]    False
Name: Silver, dtype: bool

For example above expression will give us all the countries who have won 5 or more Silvers
The expression is broadcasted to all the values in o['Silver] series and returns a boolean output.